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Abstract 

We introduce three differentially-private algorithms that approximate the 2nd-moment matrix of 
the data. These algorithm, which in contrast to existing algorithms output positive-definite matrices, 
correspond to existing techniques in linear regression literature. Specifically, we discuss the following 
three techniques, (i) For Ridge Regression, we propose setting the regularization coefficient so that by 
approximating the solution using Johnson-Lindenstrauss transform we preserve privacy, (ii) We show 
that adding a small batch of random samples to our data preserves differential privacy, (iii) We show that 
sampling the 2nd-moment matrix from a Bayesian posterior inverse-Wishart distribution is differentially 
private provided the prior is set correctly. We also evaluate our techniques experimentally and compare 
them to the existing “Analyze Gauss” algorithm of Dwork et al [DTTZli] . 


1 Introduction 

Differentially private algorithms [DMNSO^ lDKM~*~dq are data analysis algorithms that give a strong guar¬ 
antee of privacy, roughly stated as: by adding to or removing from the data a single datapoint we do not 
significantly change the probability of any outcome of the algorithm. The focus of this paper is on differ¬ 
entially private approximations of the 2nd-moment matrix of the data — given a dataset D G its 

2nd-moment matrix (also referred to as the Gram matrix of data or the scatter matrix if the mean of D is 0 ) 
is the matrix D — and the uses of such approximations in linear regression. Indeed, since the 2nd-moment 
matrix of the data plays a major role in many data-analysis techniques, we already have differentially private 
algorithms that approximate the 2nd-moment matrix |DTTZ1^ for the purpose of approximating the PCA, 
techniques for approximating the rank-fc PCA of the data directly |HR12[ IHarl3[ IKT13| . or differentially 
private algorithms for linear regressions |CMS11[ IKST12i ITS13i IBSTldj . 

However, existing techniques for differentially private linear regression suffer from the drawback that they 
approximate a single regression. That is, they assume that each datapoint is composed of a vector of features 
X and a label y and find the best linear combination of the features that predicts y. Yet, given a dataset 
D with d attributes we are free to pick any single attribute as a label, and any subset of the remaining 
attributes as features. Therefore, a database with d attributes yields exp(d) potential linear regression 
problems; and running these algorithms for each linear regression problem separately simply introduces far 
too much random noiseE] 

In contrast, the differentially private techniques that approximate the 2nd-moment matrix of the data, 
such as the Analyze Gauss paper of Dwork et al |DTTZ14] , allow us to run as many regressions on the data 

^Indeed, Ullman |U1115| have devised a solution to this problem, but this solution works in the more-cumbersome online 
model and requires exponential running-time for the curator; whereas our techniques follow the more efficient offline approach. 
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as we want. Yet, to the best of our knowledge, they have never been analyzed for the purpose of linear 
regression. Furthermore, the Analyze Gauss algorithm suffers from the drawback that it does not necessarily 
output a positive-definite matrix. This, as discussed in |XKI11] and as we show in our experiments, can be 
very detrimental — even if we do project the output back onto the set of positive definite matrices. And 
though the focus of this work is on linear regression, one can postulate additional reasons why releasing a 
positive definite matrix is of importance, such as using the output as a kernel matrix or doing statistical 
inference on top of the linear regression. 

Our Contribution. In this work, we give three differentially private techniques for approximating the 
2nd-moment matrix of the data that output a positive-definite matrix. We analyze their utility, both 
theoretically and empirically, and more importantly — show how they correspond to existing techniques in 
linear regression. And so we contribute to an increasing line of works |BBDS12l IVZ151IWFSIS] that shows 
that differential privacy may rise from existing techniques, provided parameters are set properly. We also 
compare our algorithms to the existing Analyze Gauss technique. 

(Some notation before we introduce our techniques. We assume the data is a matrix A G with n 

sample points in d dimensions. For the ease of exposition, we focus on a single regression problem, given by 
A = \X\y\ — i.e., the label is the d-th column and the features are the remaining p = d — 1 columns. We 
use (Tmin(A) to denote the least singular value of A.) 

1. The Johnson-Lindenstrauss Transform and Ridge Regression. Block! et al [BBDS12] have shown 
that projecting the data using a Gaussian Johnson-Lindenstrauss transform preserves privacy if (Tniin(A) 
is sufficiently large and it has been applied for linear regression |Upal4| . Our first result improves on the 
analysis of Block! et al and uses a smaller bound on (Tmin(^) (shaving off a factor of log(7’) with r denoting 
the number of rows in the JL transform). This result implies that when crinin(A) is large we can project 
the data using the JL-transform and output the 2nd-moment matrix of the projected data and preserve 
privacy. Furthermore, it is also known |Sar06| that the JL-transform gives a good approximation for linear 
regression problems. However, this is somewhat contradictory to our intuition: for datasets where y is 
well approximated by a linear combination of X, the least singular value should be small (as A’s stretch 
along the direction (/3, —1)^ is small). That is why we artihcially increase the singular values of A by 
appending it with a matrix w ■ Idxd- It turns out that this corresponds to approximating the solution of the 
Ridge regression problem [Tik63[ IHK70| , the linear regression problem with Z 2 -regularization — the problem 
of finding f3^ = argmin/ 3 ^- \\yi — f3 ■ XiW^ ic^||/3|p. Literature suggests many approaches [HTF09] to 
determining the penalty coefficient uA, approaches that are based on the data itself and on minimizing risk. 
Here we give a fundamentally different approach — set w as to preserve (e, J)-differential privacy. Details, 
utility analysis and experiments regarding this approach appear in Section 

2. Additive Wishart noise. Whereas the Analyze Gauss algorithm adds Gaussian noise to A'^A, here we 
show that we can sample a positive definite matrix W from a suitably chosen Wishart distribution Wd{V, k), 
and output A~^A -h W. This in turn corresponds to appending A with k i.i.d samples from a multivariate 
Gaussian A/’(Od,H). One is able to view this too as an extension of Ridge regression, where instead of 
appending A with d fixed examples, we append A with fc « d -|- 0(l/e^) random examplesp] Note, as 
opposed to Analyze Gauss [DTTZ14] . where the noise has 0-mean, here the expected value of the noise 
is kV. This yields a useful way of post-processing the output: A~^A W — kV. Details, theorems and 
experiments with additive Wishart noise appear in Section]^ 

3. Sampling from an inverse- Wishart distribution. The Bayesian approach for estimating the 2nd- 
moment matrix of the data assumes that the n sample points are sampled i.i.d from some Af {Od,V) for 
some unknown V, where we have a prior distribution on V. Each sample point causes us to update our 
belief on V which results in a posterior distribution on V. Though often one just outputs the MAP of 
the posterior belief (the mean of the posterior distribution), it is also common to output a sample drawn 
randomly from the posterior distribution. We show that if one uses the inverse-Wishart distribution as 
a prior (which is common, as the inverse-Wishart distribution is a conjugate prior), then sampling from 

^Though it is also tempting to think of this technique as running Bayesian regression with random prior, this analogy does 
not fully carry through as we discuss later. 
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the posterior is (e, (5)-diffrentially private, provided the prior is spread enough. This gives rise to our third 
approach of approximating A — sampling from a suitable inverse Wishart distribution. We comment 
that the idea that existing techniques in Bayesian analysis, and specifically sampling from the posterior 
distribution, are differentially-private on their own was originally introduced in the beautiful and elegant 
work of Vadhan and Zheng |VZ15j . But whereas their work focuses on estimating the mean of the sample, 
we focus on estimating the variances/2nd-moment. Details, theorems and experiments on sampling from the 
inverse-Wishart distribution appear in Section 

Finally, in Section we compare our algorithms to the Analyze Gauss algorithm. We show that in the 
simple case where the data is devised by p independent features concatenated with a single linear combination 
of the features, the Analyze Gauss algorithm, which introduces the least noise out of all algorithms, is clearly 
the best algorithm once n is sufficiently large. However, when the data contains multiple such regressions 
and therefore has small singular values, the situation is far from being clear cut, and indeed, unless n is 
extremely large, our algorithms achieve smaller errors than the Analyze Gauss baseline. We comment that 
our experiments should be viewed solely as a proof-of-conecpt. They are only preliminary, and much more 
experimentation is needed to fully evaluate the benefits of the various algorithms. 


Our proof technique. Before continuing to preliminaries and the formal details of our algorithms, we 
give an overview of the proof technique. (All of the proofs are deferred to Appendix]^) To prove that each 
algorithm preserves (e, (5)-differential privacy we state and prove 3 corresponding theorems, whose proofs 
follow the same high-level approach. As mentioned above, one theorem improves on a theorem of Blocki et 
al [BBDST2j, who were the first to show that the JL-transform is differentially private. Blocki et al observed 
that by projecting the data using a (r x n)-matrix of i.i.d normal Gaussians, we effectively repeat the same 
one-dimensional projection r independent times. So they proved that each one-dimensional projection is 
(e, 5)-differentially private, and to show the entire projection preserves privacy they used the off-the-shelf 
composition of Dwork et al [DRVlOj . getting a bound that depends on 0{^/r\og{r)). In order to derive a 
bound depending only on 0{^/r), we do not use the composition theorem of |DRV10) but rather study the 
specific r-fold composition of the projection. As a result, we cannot follow the approach of Blocki et al. 

To show that a one-dimensional projection is (e, i5)-differentially private, Blocki et al compared the PDFs 
of two multivariate Gaussians. The PDF of a multivariate Gaussian is given by the multiplication of two 
terms: the first depends on the determinant of the variance, and the second depends on some exponent (see 
exact definition in Section]^. Blocki et al compared the ratio of each of the terms and showed that w.h.p 
each term’s ratio is bounded by . Unfortunately, following the same approach of Blocki et al yields a 
bound of for each of the terms and an overall bound that depends on 0(r). Instead, we observe that 
the contributions of the determinant term and the exponent term to the ratio of the PDFs are of opposite 
signs. So we use the Matrix Determinant Lemma and the Sherman-Morrison Lemma (see Theorem A.4) 
to combine both terms into a single exponent term, and bound its size using the Johnson-Lindenstrauss 
transform (or rather, tight bounds on the x^-distribution). The main lemma we use in our analysis is 
detailed in Lemma |A.1[ This lemma, in addition to giving tight bounds for the Gaussian JL-transform 
(mimicking the approach of Dasgupta and Gupta |DG03| L also gives a result that might be of independent 
interest. The standard JL lemma shows that for a (r x d)-matrix i? of i.i.d normal Gauss ians and any fixed 
vector V it holds w.h.p that v^v S (1 ± {-EAR)v provided r = 0{ri~‘^). In Lemma A.l we also show 


that for any fixed v we have w.h.p. that v~^v S (1 ± ri)v~^R) provided r = d + 0{ri 


-»).0 


2 Preliminaries and Notation 

Notation. Throughout this paper, we use lower-case letters to denote scalars; hold characters to denote 
vectors; and UPPER-case letters to denote matrices. The Z-dimensional all zero vector is denoted 0;, and 
the {I X 77i)-matrix of all zeros is denoted 0/xm- The /-dimensional identity matrix is denoted Axi- For two 

®To the best of our knowledge, for a general JLT, this is known to hold only when r = 0(d- 'q~^) and the transform preserves 
the lengths of all vectors in the space, see ISarOGI Corollary 11. 
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matrices M, N with the same number of row we use [M; N] to denote the concatenation of M and N. We 
use e, S to denote the privacy parameters. For a given matrix, ||M|| denotes the spectral norm (= amax{M)) 
and |iM||i? denotes the Frobenious norm k use cTmu^iM) and to denote its largest 

and smallest singular value resp. 

The Gaussian Distribution and Related Distributions. We denote by Lap{a) the Laplace dis¬ 
tribution whose mean is 0 and variance is 2cr^. A univariate Gaussian A/'(/i,cr^) denotes the Gaussian 
distribution whose mean is p and variance tr^. Standard concentration bounds on Gaussians give that 
Pr[ X > p + cr-\/ln(l/i/)] < v. A multivariate Gaussian S) for some positive semi-dehnite S denotes 

the multivariate Gaussian distribution where the mean of the j-th coordinate is the pj and the co-variance 
between coordinates j and k is Sj fe. The PDF of such Gaussian is defined only on the subspace colspan{T,), 

where for every x G colspan{Y^) we have PDF(a;) = • det(E)^ exp — fi) Sl(x — 

and det(S) is the multiplication of all non-zero singular values of S. We will repeatedly use the rules re¬ 
garding linear operations on Gaussians. That is, for any scalar c, it holds that o/V (/r, cr^) = N (c- p, c^cr^). 

For any matrix C it holds that C ■ N = N . 

The x^-distribution, where k is referred to as the degrees of freedom of the distribution, is the distribution 
over the Z 2 "iiorm of the sum of k independent normal Gaussians. That is, given Xi,..., ~ A/” (0,1) 

it holds that ( (Xi, X 2 , ■. ■, Xk) ~ ■A' (Ok, Ikxk), and ||C|P xl- Standard tail bounds on the y^- 

distribution give that for any ix G (0, |) we have Pr^^^ 2 [x G (^VkA y^2 h\(2/v'^ ]> ^ — v- (We present 

them in Section]^ for completeness.) The Wishart-distribution yVd(V,m) is the multivariate extension of 
the x^-distribution. It describes the scatter matrix of a sample of m i.i.d samples from a multivariate 
Gaussian Af (Od, V) and so the support of the distribution is on positive definite matrices. For m > d — 1 we 
have that PDFvy^(y_m)(A) oc det{V)~^ det(A)™ 2 exp{—^tT{V~^X). The inverse-Wishart distribution 

W^^(V,m) describes the distribution over positive dehnite matrices whose inverse is sampled from the 
Wishart distribution using the inverse of V\ i.e. X ^ W^^(V,m) iff X~^ ~ m). For m > d — 1 it 

holds that PDFyy-i^y^j(A) oc det(y)^ det(X)“"*'^ 2 *’''^ exp(—ftr(yA“^)). 

Differential Privacy. In this work, we deal with input of the form of a (n x d)-matrix with each row 
bounded by a / 2 -iiorm of B. Gonverting A into a linear regression problem, we denote A as the concatenation 
of the (n X p)-matrix X with the vector y G M" (A = [X;y]) where p = d — 1. This implies we are tying to 
predict y as a linear combination of the columns of X. Two matrices A and A! are called neighbors if they 
differ on a single row. 

Definition 2.1 ( |nMNSnfil lnKM+n6| L An algorithm A LG which maps (n x d)-matrices into some range 
TZ is (e, (5)-differential privacy if for all pairs of neighboring inputs A and A' and all subsets S <ZTZ it holds 
that Pr[ALG(A) £ 5] < e*^Pr[ALG(A') £ 5] -F d. When S = 0 we say the algorithm is e-differentially private. 

It was shown in [DMNS06] that for any / where ||/(A) — /(A')||i < A then the algorithm that adds 
Laplace noise Lap(^) to f(A) is e-differential privacy. It was shown in |DKM~*~d^ that for any / where 

II/(A) —/(A') II 2 < A then adding Laplace noise Ad ^0, to /(A) is (e, d)-differential privacy. This is 

precisely the algorithm of Dwork et al in their “Analyze Gauss” paper |DTTZ14] . They observed that in our 
setting, for the function /(A) = A^A we have that ||/(A) — /(A')|||, = B^. And so they add i.i.d Gaussian 
noise to each coordinate of A^A (forcing the noise to be symmetric, as A^A is symmetric). We therefore 
refer to this benchmark as the Analyze Gauss algorithm. In addition, it is known that the composition of 
two algorithms, each of which is (e, d)-differentially private, yields an algorithm which is (2e, 2d)-differentially 
private. 
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3 Ridge Regression — Set the Regularization Coefficient to Pre¬ 
serve Privacy 

The standard problem of linear regression, finding (3 = argmin^ 11-^/3 ~ vW'^i relies on the fact that X is of 
full-rank. This clearly isn’t always the case, and X^X may be singular or close to singular. To that end, 
as well as for the purpose of preventing over-fitting, regularization is introduced. One way to regularize the 
linear regression problem is to introduce a / 2 -penalty term: finding = argmin,a 

This is known as the Ridge regression problem, introduce by |Tik63[ IHK70j in the 60s and 70s. Ridge 
regression has a closed form solution: /3^ = {X'^X -|- w‘^Ipxp)X~^y. The problem of setting w has been 
well-studied |HTF09| where existing techniques are data-driven, often proposing to set w as to minimize the 
risk of (3^. Here, we propose a fundamentally different approach to the problem of setting w: set it so that 
we can satisfy (e, 5)-difFerential privacy (via the Johnson-Lindenstrauss transform). 

Observe, the Ridge regression problem can be written as: minimize||X/3 — y\\^ + \\wlpxp(3 — Op|p. So, 
denote X' are the ((n + p) x p)-matrix which we get by concatenating X and wipxp, and denote y' as 
the concatenation of y with p zeros. Then /3^ = argmin||V'/3 — Since p = d — 1 and we denote 

A= [V;y], we can in fact set A' as the concatenation of A with the d-dimensional matrix wl^xd: and we 


have that f{f3) 


A' ( 



-1 J 


= \\X'f3-y' 11^ +w'^. Hence /3^ = argmin/(/3). Hence, an approximation 


of A!^A! yields an approximation of the Ridge regression problem. One way to approximate AA^A! is via the 
Johnson-Lindenstrauss transform, which is known to be differentially private if all the singular values of the 
given input are sufficiently large |BBDS1^ . And that is precisely why we use A — all the singular values 
of A are greater by nP' than the singular values ofA~^A, and in particular are always > w^. Therefore, 
applying the JLT to A gives an approximation of A^ A^ and furthermore, due to the work of Sarlos [Sar06] 
the JLT also approximates the linear regression. The following theorem improves on the original theorem of 
Block! et al |BBDS12| . 

Theorem 3.1. Fix e > 0 and S G (0, ^). Fix B > 0. Fix a positive integer r and let w be such that 
uP = ln(|) -|- ln(|)^ /e. Let A be a {n x d)-matrix with d < r and where each row of A has 

bounded L 2 -norm of B. Given that crinin(A) > w, the algorithm that picks a {r x n)-matrix R whose entries 
are i.i.d samples from a normal distribution Af (0,1) and publishes R ■ A is (e, 6)-differentially private. 

This gives rise to our first algorithm. Algorithm gets as input the parameter r — the number of rows 
in our JLT, and chooses the appropriate regularization coefficient w. Based on Theorem 3.1 and above- 
mentioned discussion, it is clear that Algorithm is (e, J)-differentially private. Furthermore, based on the 
work of Sarlos, we can also argue the following. 

Theorem 3.2. /fSarddf . Theorem 12] Fix any rj > 0 and v G (0, ^). Apply Algorithm with r = 

^ 1 _ 11 nR _ rR\\ _ n _ f/nR\ 


0{d\og{d)\n{l/iy)/A). Then, w.p >l — i' it holds that \\f3^ — l3^\\ < 




fm- 


Existi ng r esults about the expected distance E[||/3'^ — I3\\^] (see |DFKU13] ') can be used together with 


3.2 


to give a bound on — (3\ 


Theorem 

In addition to Algorithm]^ we can use part of the privacy budget to look at the least singular-value of 
APA. If it happens to be the case that aynin{A^A) is large, then we can adjust w by decreasing it by the 
appropriate factor. In fact, one can completely invert the algorithm and, in case cri„in(A^A) is really large, 
not only set the regularization coefficient to be any arbitrary non-negative number, but also determine r 
based on Thm |3.l] Details appear in Algorithm 

To measure the effect of regularization we ran the following experiment. (Since the same experimental 
setting is used in the following sections we describe it here lengthly, and refer to it in later sections.) 
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Input: A matrix A € and a bound B > 0 on the Z 2 -norm of any row in A. 
Privacy parameters: e, d > 0. 

Parameter r indicating the number of rows in the resulting matrix. 

Set w = ^452 (^y27ln(|) +ln(|)y/e. 

Set A' as the concatenation of A with voided- 

Sample a r x (n + (i)-matrix R whose entries are i.i.d samples from a normal Gaussian, 
return M = ^{RA'y {RA') and the approximation (3^ = argmin^^=_i 

Algorithm 1: Approximating Ridge Regression while Preserving Privacy 


Input: A matrix A G and a bound B > 0 on the l 2 -norm of any row in A. 

Privacy parameters: e, d > 0. 

Parameter tq indicating the minimal number of rows in the resulting matrix. 

Set w = y^ (\/^o ln(|) + 111(1)). 

Set s G- max |o, crniin(A^A) — 25 \ri(2/s) -(- where Z ~ Lap{^^). 

Adjust w G- a/ max{0, w'^ — s}. 

if ui > 0 then 

Set A' as the concatenation of A with widxd- 

Sample a r x (n + dj-matrix R whose entries are i.i.d samples from a normal Gaussian, 
return M = w and the approximation /3^ = argmin^^=_i/3^M/3. 

else 

Setr*^max{reZ: ^ (^2rln(f) + ln(f)) < s|. 

Sample a (r* x n)-matrix R whose entries are i.i.d samples from a normal Gaussian, 
return M = approximation (3 = argmin^^=_i/3^M^. 

end 

Algorithm 2: Approximating Regression (Ridge or standard) while Preserving Privacy. 


3.1 The Basic Single-Regression Experiment — Setting 

To compare between the various algorithms we introduce and to analyze their utility we ran experiments 
testing their performance over data generated from a multivariate Gaussian. The experiments all share the 
same common setting, but each experiment studied a different set of estimators. In this section we detail 
the common setting, and in the next one we details the specific estimators and results of each experiment 
separately. 

We pick p = 20 i.i.d. features sampled from a normal Gaussian, and pick some (3 Gr [-1,1]^+^ (the 
last coordinate denotes the regression’s intercept), and set y as the linear combination of the features and 
the intercept (the all-1 column) plus random noise sampled from Af (0,0.5). Hence our data had dimension 
d = p+2 = 22 and the 21-dimensional vector (3 has I 2 of about 3. We vary n to take any of the values in {2^'^ = 
4, 096, 2^5, 2^6,..., 2^5 = 33, 554,432}. We vary e to take any of the values {0.05, 0.1,0.15,0.2, 0.25,0.5}, and 
fi^|.= e and use the ^ 2 -bound of B = y/2.5d. (As preprocessing, each datapoint whose length is > B 
is shrunk to have length B.) For each estimator we experimented with, we run it t = 15 times, and report 
the mean and standard variation of the 15 experiments. In all experiments we measure the / 2 "distance 
between the outputted estimator of each algorithm to the true j3 we used to generate the data. After all, 
the algorithms we give are aimed at learning the (3 that generated the given samples, and so they should 

^We are aware that it is a good standard practice to set <5 < ^ since otherwise, sampling from the data is (e, (S)-differentially 
private. However, as we vary n drastically, we aim to keep all other parameters equal. 
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return an estimator close to the true (3. We coded all experiments in R and ran the experiments on standard 
laptop. 


3.2 Experiment on Ridge Regression — Measuring the Effect of Regularization 

To measure the effect of regularization we ran the following experiment in the setting detailed in Section [3.1| 
For each choice of e and n we ran three predictors. The first one is based on Algorithmwith tq = 2d. The 
second one is Algorithmic where we fed it the parameter r that the first one used, just so all predictors will 
be comparable. The last one is the non-private version that projected the data itself, without appending it 
with the w ■ Idxd matrix (again, using the same parameter r as the other two predictors). The results are 
given in Figure [C 


epsilon-0.5 



epsilon-0. IS 



epsilon=0.25 



epsilon=0.1 



epsilon-0.2 



epsilon-0.05 



Figure 1: (best seen in color) A comparison of the average l 2 -error of the JL-based estimators. Algorithmic 
in blue. Algorithm |C in red, and the non-private version in black. The cc-axis is the size of the data in 
log-scale. 


The results are strikingly similar across all values of e. Initially the error of the predictors is very high 
(for the value /3 we used to generate the data, ||/3|| « 2.786, so such levels on noise mean in fact zero utility). 
Furthermore, it takes a while until Algorithm |C (in blue) outperforms the more naive Algorithm |C (in red). 
(In most experiments, it happens only once n > 2^® or n > 2®®.) This implies that the privacy-budget 
“wasted” on the private estimation of the least singular value of the data actually ends up reducing our 
utility but not by a large factor. Towards the largest value of n, Algorithmic actually does noticeably better 
than Algorithm|Cby a multiplicative factor of « 3.5 to « II (for n = 2®^ when e = 0.1 we have mean accuracy 
of 0.0192 vs. 0.0671; when e = 0.5 we have mean accuracy of 0.0058 vs. 0.0639). In all experiments, the 
non-private estimator (in black) was clearly the best for all values of n. 
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4 Additive Wishart Noise — Regression with Additional Random 
Examples 

As discussed in the previous section, Ridge regression can be viewed as regression where in addition to 
the sample points given by [A';y] we see d additional datapoints given by widxd- Our second techniques 
follows this approach, only, instead of introducing these d fixed datapoints, we introduce a few more than 
d datapoints which are random and independent of the dataj^ Formally, we give the details in Algorithm 
and immediately following — the theorem proving it is (e, ^)-difFerentially private. 


Input: A matrix A G and a bound B > 0 on the Z 2 -norm of any row in A. 

Privacy parameters: e,S > 0. 

Set [d + ^ ■21n(4/d)j. 

Sample Vi,V 2 , ■ ■ ■ ,Vk i-i.d examples from Af (O^, B^Idxd)- 

return M = A^A + ViVj^ and the approximation (3 = argmin^^=_i pA Mf3. 

Algorithm 3: Additive Wishart Noise Algorithm 


Theorem 4.1. Fix e G (0,1) and 5 G (0, -). Fix B > 0. Let A be a {n x d)-matrix where each row of A 
has bounded l 2 -norm of B. Let N be a matrix sampled from the d-dimensional Wishart distribution with 
k-degrees of freedom using the scale matrix B^ ■ Idxd (i-e., N ~ yVd{B^ ■ Idxd, k)) for k > [d + ^ • 2 ln(4/5)J. 
Then outputting X = A^A N is (e, 5)-differentially private. 

Note: Ridge Regression also has a Bayesian interpretation, as introducing a prior on P in regression 
problem. It is therefore tempting to argue that Theorem 4.1 implies that solving the regression problem 


with a random prior preserves privacy. (I.e., output the MAP of /? after setting its prior to a random sample 
from the Wishart distribution.) However, this analogy isn’t fully accurate, since our algorithm also adds 
random noise to X^y. Indeed, regardless of what prior we use for P, A y — On then we always output Op as 
the estimator of /3, so one can differentiate between the case that y = 0„ and y 0„. We leave the (very 
interesting) question of whether Wishart additive random noise can be interpreted as a Bayesian prior for 
future work. 

We give a bound on the utility of the estimator we get with this technique in Theorem |C.3| However, 
we are more interested in the utility of this approach after we remove some of the noise we add in this 
technique. Note, E[A] = kB^ ■ Idxd, and so it stands to reason that we output A^A + A — kB^ ■ Idxd- 
Now, when CTmin(A^A) is small, we run the risk that some of the eigenvalues of A~^A + N are smaller 
then kB^, causing some of the eigenvalue of A\A + N — kB^ ■ Idxd to be negative (which means we no 
longer output a PSD). In such a case. Lemma A.3 assures us that w.h.p we can decrease A^A + A by 

B^ (^Vk — {s/d + -^2 ln(4/(5))^ • Idxd and maintain the property that the output is positive definite matrix. 
This is the algorithm we set to evaluate empirically. 


4.1 Experiment on Additive Wishart Noise 

To evaluate the utility of the additive random Wishart noise algorithm we implemented and ran the algorithm 


in the same setting as detailed in Section 3.1 For each choice of e and n we ran three predictors. The first one 
is the naive and non-private linear regression, that uses the data with no additive noise (i.e., P). The second 
one is given by Algorithm The last one is the estimator we get using the output of Algorithm minus 

Ait — ( Vd -\- a/ 2 ln( 


either kB^ ■ Idxd or B^ \^Vk — (Vd + \/21n(4/d))j • Idxd (whichever of the two we can use and maintain 
positive definiteness). We repeat each experiment t = 15 times, measuring the Z 2 -distance between the 
outputted estimator of each algorithm to the true P we used to generate the data. (This yields randomness 


^Independent of the data itself, but dependent of its properties. Our noise does depend on the ^ 2 -bound B. 











in 11/3 — /3||, since every time we re-sample the data.) We report the mean and standard variation of the 15 
experiments. The results are given in FigureThe results are again consistent across the board — reducing 
the noise also reduces the error, and indeed the second estimator is consistently doing better than the naive 
estimator. 



Figure 2: (best seen in color) A comparison of the average Z 2 -error of the Wishart additive noise estimators. 
Algorithm in blue, deducting the expected shift from the output of Algorithm and then running the 
regression is in red, and the non-private version in black. The x-axis is the size of the data in log-scale. 


5 Sampling from an Inverse-Wishart Distribution (Bayesian Pos¬ 
terior) 

In Bayesian statistics, one estimates the 2nd-moment matrix in question by starting with a prior and updating 
it based on the examples in the data. More specifically, our dataset A contains n datapoints which we assumed 
to be drawn i.i.d from some J\f (Od, V). We assume V was sampled from some distribution T> over positive 
definite matrices, which is the prior for V. We then update our belief over V using the Bayesian formula: 
Pr[V I A] = j — pr'jl.f ■ Finally, with the posterior belief we give an estimation of V — either by 
outputting the posterior distribution itself, or by outputting the most-likely V according to the posterior, 
or by sampling from this posterior distribution (maybe multiple times). In this work we assume that our 
estimator of V is given by sampling from the posterior distribution. 

One of the most common priors used for positive definite matrices is the inverse-Wishart distribution. 
This is mainly due to the fact that the inverse-Wishart distribution is conjugate priorj^ Specifically, if 

®A family of distributions is called conjugate prior if the prior distribution and the posterior distribution both belong to this 
family. 
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our prior belief is that V ^ then after viewing n examples in the dataset A our posterior is 

^ ~ ((ATA + ^'),n + fc). Here we show that sampling such a positive definite matrix V from our 

posterior inverse-Wishart distribution is (e, (5)-difFerentially private, provided the prior distribution’s scale 
matrix, dt, has a sufficiently large (Tmin('l')- This result is in line with the recent beautiful work of Vadhan 
and Zheng [VZ15j . who showed that many Bayesian techniques for estimating the means are differentially 
private, provided the prior is set correctly. The formal description of our algorithm and its privacy statement 
are given below. 


Input: A matrix A G and a bound B > 0 on the Z 2 -norm of any row in A. 

Privacy parameters: e, 5 > 0. 

Set if) G- ^2y^2^rir+liyhi(47(^ + 21n(4/5)^ . 

Sample M ~ + • Idxd),n + d). 

return M and the approximation (3 = argmin^^=_i 0^Mf3. 

Algorithm 4: Sampling from an Inverse-Wishart Distribution 


Theorem 5.1. Fix e > 0 and S G (0, i). Fix B > 0. Let A be a {n x d)-matrix and fix an integer u > d. 
Let w be such that w'^ = 2B^ [202v\n{3:/5) -\- 21n(4/i5)^/e. Then, given that a^in{A) > w, the algorithm 
that samples a matrix from A,n) is {€,6)-differentially private. 

We comment on the similarities between Theorem |5.1| and Theorem |3.1[ Indeed, the Algorithm es¬ 
sentially samples a matrix from yV(A^A w^I,k) for some choice of w and k (and then normalizes the 
sample by ^); and Algorithm H samples a matrix from A-GvJ^I, k) for a very similar choice of w. In 

fact, in Algorithm instead of sampling R and then multiplying it with A, we can sample the same R and 
multiply it with [A' or even sample a (r x (i)-matrix R where each of its rows is sampled i.i.d from 

N (O^;, (All of those have the same distribution over the output.) And so, much like we did in the 

Johnson-Lindenstrauss case, we can also use part of the privacy budget to estimate (Tniin(A^A) and then set 
the parameter ip accordingly. Details appear in Algorithm 


luput: A matrix A G and a bound B > 0 on the Z 2 -norm of any row in A. 

Privacy parameters: e, 5 > 0. 

A parameter kg indicating the minimal degrees of freedom. 

Set V’ ^ ^ (2^2fco ln(8/(5) -f 2 ln(8/,5)). 

Set s G- max |o, CTniin(A^A) — in(2/<5) -I- zj where Z ~ Lap{^^). 

Adjust if G- max{0, ip — s}. 

if w > 0 theu 

I Sample M A -\-tp ■ Idxd),ko). 

else 

Set k* G- max {/fc e Z : ^ (202k\n{8/6) + 2 ln(8/(5)) < s} 

Sample M - Wf^{A'^A, k*). 

end 

returu M and the approximation (3 = argmin^^=_i 0^M(3. 

Algorithui 5: Sampling from an Inverse-Wishart Distribution whose degrees of freedom are determined 
by the input. 
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5.1 Experiments on Sampling from the Inverse Wishart Distribution 

To estimate the utility of Algorithms and we conduct similar experiments to before, in the same setting 
detailed in Section [3T| For each choice of e and n we ran 5 predictors, (i) The first one (in black) is the naive 
and non-private Bayesian posterior sampling from the inverse Wishart distrbituion. (ii) The second one is 
given by Algorithm]^ (in blue), (iii) The third one is given by Algorithm]^ (in red) where the min-degrees- 
of-freedom parameter is set to n -|- d (so that we have a direct way to compare between Algorithm and . 

(iv) The fourth is given by Algorithm where the min-number-of-rows parameter is set 2d (in green), and 

(v) the fifth one is Algorithm when the min-degrees-of-freedom parameter is set 2d. This gives us a direct 
comparison between AlgorithmsandWe repeat each experiment t = 15 times, measuring the / 2 "distance 
between the outputted estimator of each algorithm to the true /3 we used to generate the data. We report the 
mean and standard variation of the 15 experiments in Figure The results are again consistent among the 
various choices of e. Both Algorithmand(techniques (ii) and (iii)) exhibit fairly large errors throughout, 
mainly due to the fact that the parameter tjj used in each algorithm depends in n, as opposed to any other 
algorithm we present. We were surprised to see how little variance there exists in the results (the variance 
is too small to be visible in the hgure). We did find it surprising that for the most part, the fact we split 
the privacy budget in Algorithm]^ turns out to be consistently costlier than Algorithm]^ even for very large 
values of n. Another result that we found interesting is that technique (v) outperforms the JL-technique (iv) 
(and it is holds for all values of n). Initially we conjectured that the gap can be explained by Lemma A.l[ 
where the bound for the inverse-JL has a slightly better second order term than the bound for the standard 
JL. However, for some values of n the gap is fairly noticeable, and we leave it as an open problem to see if 
this holds for any projection matrix (and not just JL). 


6 Comparison to the Analyze Gauss Baseline 

In this paper we discuss multiple ways for outputting a differentially private approximation of A. One 
such way was already given by Dwork et al in their “Analyze Gauss” paper |DTTZ14] . As mentioned 
already, Dwork et al simply add to A'^A a symmetric matrix N whose entries are sampled i.i.d from a 
suitable Gaussian. Furthermore, the magnitude of the noise introduced by the Analyze Gauss algorithm is 
the smallest out of all algorithms. Yet, as we stressed before, the output of Analyze Gauss isn’t necessarily a 
positive definite matrix. In this work we investigate the effect of these fact on the problem of linear regression. 
We study the utility of the Analyze Gauss algorithm for the linear regression problem both theoretically 
(the theorem regarding the utility of Analyze Gauss is deferred to Appendix and experimentally, in 
comparison to the other algorithms we introduce in this work. The high-level message from the experiments 
we show here as follows. In the simple case. Analyze Gauss is the best algorithm to use,Q and when it 
returns “unreasonable” answers — so do all other algorithms we use (details to follow). However, there do 
exist cases where it under performs in comparison to the additive Wishart noise algorithm (Algorithm 
and the Wishart (Algorithm]^ or inverse-Wishart (Algorithm]^ sampling algorithms. 

In this section we compare between the following 6 techniques. 

1. Analyze Gauss algorithm: output A^A + N with N a symmetric matrix whose entries are i.i.d samples 
from a Gaussian (black line, squares.) 

2. The JL-based algorithm. Algorithm(blue line, squares.) 

3. The additive Wishart noise algorithm given by Algorithm(magenta line, squares.) 

4- A scaling version of Analyze Gauss: if the output of Analyze Gauss is not positive definite, add cldxd to 
it with c = E[|| A^|j](black line, circles.) 

5. Algorithm which, as we commented in the experiments of Section is analogous to Algorithm and 
seems to consistently do better than AlgorithmBoth Algorithm]^ andwere given the same min-degrees- 
of-freedom parameter: 2d (blue line, circles.) 

6. The scaling version of the additive Wishart random noise, as detailed in the experiment of Section 

^In our opinion, this result is of interest by itself. 


II 








epsilon=0.5 epsilon=0.25 epsilon=0.2 



Figure 3: (best seen in color) A comparison of the average / 2 -error for the estimators based on inverse- 
Wishart distribution sampling. The non-private sampler is in black, Algorithm is in blue and Algorithm 
in red. The JL-based algorithm (Algorithm]^ that effectively samples from the Wishart distribution is in 
green; and its analogous algorithm that samples from the inverse-Wishart distribution (Algorithm is in 
magenta. The x-axis is the size of the data in log-scale. 


I.e., outputting A^A + W — k ■ V (if this leaves the output positive definite) or A'^A + W - {Vk- {Vd + 
•\/2ln(47^))^ • V otherwise (magenta line, circles.) 

Post-processing the Analyze Gauss output. We have experimented extensively with multiple ways 
to project the output of the Analyze Gauss algorithm onto the manifold of PSD matrices. Indeed, the most 
naive approach is to find a PSD matrix M as to minimize \\M — (A~^A -|- A^)|||n. Such M effectively turns to 
be the result of zeroing out all negative eigenvalue of {AAA + N). The utility of this approach turns out to 
be just as bad as the standard Analyze Gauss algorithm (with no post-processing), returning estimations of 
size 12 or 9 when the true fH has ||/3|| « 3. Other approached we have experimented with were to try other 
values of c for a post-processing of the form A^A + N + cldxd- (Such as setting c to be the upper- and 
lower-bound on the singular values of N w.p. > 1 — 5.) The performance of such approaches was, overall, 
comparable to the chosen technique (setting c = E[||7V||]) but with worse performance then our choice of c. 
Therefore, in our experiment, we used the best of all techniques we were able to come up with to post-process 
Analyze Gauss. This, however, does not mean that there isn’t another post-process technique for Analyze 
Gauss that we didn’t think of which out-performs our own approach. 
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6.1 The Basic Single-Regression Experiment 

In the same experiment setting from before (see Section [3.1[ ) we compare our 6 estimators based on the I 2 - 
distance to the true /3 that generates our observations. The results, given in Figure]^ are pretty conclusive: 
Analyze Gauss is the better of all algorithms. Indeed, for smaller values of n its output is completely out 
of scale (while ||,3|| « 3, the average error of Analyze Gauss is about 9, 12, and sometimes 30). In fact, the 
error of Analyze Gauss for small values of n is so large that we don’t even present it in our graphs (and the 
standard deviation is so large, that the error bar of Analyze Gauss results in a big spike for such values of n). 
However, it is important to notice that for such values of n all other techniques also have a fairly large error 
(recall, ||/3|| is roughly 3, so errors > 2.5 essentially give no information about /3). Once n reaches a certain 
size, then there is a sharp shift transition, and Analyze Gauss becomes the algorithm with the smallest error 
for all greater values of n. Eventually, the errors of all algorithms becomes smaller than the error between 3 
and 3 ip is ttie non-private estimator of /3). We also comment that, like before, technique 5 (Algorithm^ 
in consistently better than technique #2 (Algorithm]^, but also note that both technique have the largest 
variances in comparison to all other techniques. 


epsilopsO.S 



epsilon=0.15 



epsilon=0.2S 



epsilon=0.1 



epsilon30.2 



epsilon=0.05 



Figure 4: (best seen in color) A comparison of the average Z 2 -error for our 6 estimators. Analyze Gauss 
(squares) and its scaled version (circles) are in black; JL algorithm (squares) and the JL variant that samples 
from the inverse Wishart distribution (circles) are in blue; and the additive Wishart noise (squares) and its 
scaled version (circles) are in magenta. The x-axis is the size of the data in log-scale. 


6.2 The Multiple-Regressions Experiment 

In this paper we argue that it is important to use algorithms that inherently output a positive definite 
matrix. To that end, we now investigate a more complex case, where the data is close to being singular, such 
that additive Gaussian noise is likely to introduce much error. The example we focus on is when the data 
A is composed of 2p features: the first p = 20 columns are independent of one another (sampled i.i.d from 
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a normal Gaussian); the latter p = 20 columns are the result of some linear combination of the first p ones. 
And so A = [X;yi,... ,yp] where for every i we have yi = Xf3i + Bi where each coordinate of is sampled 
i.i.d from J\f (0,(t^) for a — 0.5 (fixed for all i). In our experiments, we vary n (from 2^^ to 2^"^ in powers of 
2), but fix e = 0.1. What we also vary is the number of y-features we use in our regression. 

Recall, our algorithms approximate the Gram matrix of the data. Once such an approximation is pub¬ 
lished, it is possible to run as many linear regressions on it as we want — fixing any one column of the data 
as a label and any subset of the remaining columns as the features of the problem. This is precisely what 
we analyze here. We look at the linear regression problem where the label is some and the features of 
the problem are the first d columns plus some m additional y-columns|^ (I.e.: {a^i,. .. Xp} U {yi, ..., ym} 
where the latter are disjoint to yi„.) A good approximation of /3 should therefore return some (3 which is 
0 (or roughly 0) on the latter m coordinates. This corresponds to what we believe to be a high-level task 
a data-analyst might want to perform: finding out which features are relevant and which are irrelevant for 
regression. 

The results in this case are far less conclusive and are given in Figure When m = 0, we are back to 
the case of a single regression (with no redundant features), and here Analyze Gauss (black, squares) out¬ 
performs all other algorithms once n is large enough (in our case, n > 2^®). Yet, it is enough to set m = 1 
to get very different results. When m > 0 it is evident that Analyze Gauss really performs badly — in fact, 
in most cases its values were far beyond the range of a reasonable approximation for f3 (taking values like 
26 and 45 where ||/3|j « 3.2). The scaled version of Analyze Gauss (black, circles) does perform significantly 
better, yet — it is not the best out of all algorithms. In fact, it is consistently worse than the JL-based 
algorithms (blue, circles and squares) and from the scaled version of the additive Wishart noise (magenta, 
circles) for n < 2^^ = 4,194,304. Note that as m increases, all algorithms’ errors become fairly large. In 
addition, Figure shows the variance of our estimators. It is clear that the scaled version of Analyze Gauss 
has the smallest variancej^ However, the scaled additive Wishart noise algorithm (magenta, circle) seems to 
have a good variance as well, and, as discussed, does out-perform the scaled Analyze Gauss algorithm for a 
wide range of values of n. 

Discussion. It is possible to interpret the results of this experiment, especially for the larger values of m, 
as a detriment for all the algorithms that approximate the Gram matrix of the data. Indeed, we pose the 
question of running regression over data where there does exist a large correlation between multiple columns 
as an open question. One approach could be to find a differentially private analogues to the techniques 
of |Mahll) for choosing a subset of the coordinates that approximate the fc-PCA. An alternatively approach 
is to analyze the Lasso regression over the output of the algorithms that approximate the 2nd-moment 
matrix. In fact, we did experiment (though not extensively) with the Lasso regression. Using off-the-shelf 
Lasso regression packages (R package named glmnet), it seems that all algorithms give estimators that are 
indeed sparse, but not specifically over the latter m coordinates. Rather, the estimator is sparse both on 
the the first p coordinates and on the latter m coordinates. In contract, running the Lasso regression on the 
data without additional randomness (non-privately) gives sparsity over the latter m coordinates. We leave 
both problems for future work. 
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in magenta. The a;-axis is the size of the data in log-scale. 
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the size of the data in log-scale. 
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A Useful Lemmas 


In this section we detail the main lemmas that we use in our privacy proofs in the following section. The 
lemmas and theorems presented here, for the most part, were known prior to our work. We chose to include 
so that the uninformed reader can have their full proof, but we make no claim as to the originality of the 
proofs of the lemmas. The proofs of Lemma |A.I| and Claim |A.2| are based in part on the result Dasgupta 
and Gupta |DG03| and in part about results regarding the Wishart distribution given in |MKB79) (Theorem 
3.4.7). We encourage the reader who is familiar with lemmas and claims in this section to skip their proofs 
and turn to Section where we prove our privacy theorems. 

Lemma A.l. Let X be a {r x d)-matrix of i.i.d normal Caussians (i.e., Xij ~ Af(0,1)J. Fix /3 € (0, i). 
Then, for any vector v it holds that 


Pr 


{lx'^X-L)v\ < + ||r;f 


> 1 - 


Furthermore, if r > d then denote t = and assume t < 1. Then 


Pr 


v-^{I-{p^,X^X)-^)v 


2t-r „ „ 

^ ( 137 )^ 11^11 


> 


( 1 ) 


Proof. Fix V. Each entry of Xv is distributed like Af{0, ||u|p) and so X^Xv is just the sum of r i.i.d 
Caussians with variance ||u|p. In other words, X~^Xv ^ xf. Concentration bounds (see Claim A.2) 

give therefore that w.p. >1-/3 we have 


(v^-v/21n(2//3))2 < 1 < {^ + ^2\n{2/[i)f 


which implies 

^2 


21 ^^ | 1^||2 < v^{lX^X-I)v < {2^J 


2 1n(2//3) 21n(2/^) A 

r r I 


and so we get the bound on v'{^X^X — I)v. 
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We now argue that ^ Xr-d+i- To see this, we argue that specifically for the vector (the 

indicator of the d-th coordinate) we have ed(xr ^ Xr-d+i) results for any v follows from 

taking any unitary function s.t. U~^v = ||u||ed, and the observation that the distributions of X and XU~^ 
are identical. 

Now, clearly ed{X~^X)~^ed = {X~^X)'^^^. Now, if we denote the last column of X as Xd and the first 

d — 1 columns of X as X-d then X^X = ^-d ^-d X-d Xd ^ Thus, the formula for the entries 

_ Xd ' X^d 

of the inverse give 

, yj yx-l = ~ Xd^ X-d{X-d^ X-d)~^ X-d^ Xd 

^)d,d 

= Xd{l- X_diX_d'^X_d)-^X_d'^) Xd =' =r/P Xd 

Now, w.p. 1 we have that X-d has full rank (d— 1). For any choice of X-d with full rank we get a matrix P 
which has rank r — (d— 1) and its eigenvalues are either 1 or 0. Hence, for any X-d we get i ~ Xr-d+i- 

^)d,d 

Since this distribution is independent of X-d we therefore have that this result holds w.p. 1. I.e.: 

PDF 




= / PDF 






— =z\l- = P • PDF (/ - = P) dP 


= J PDf^2_^^^{z) ■ PDF (/ - = P) dP 

P 

= PDF^2_^^^(0). f PDF {I - X_d{X_d'^X_d)-^X_d'^ = P) dP = PDF^2_^^^{z) 


Therefore, with probability >1-/3 we have 

G ((Vr-d+1- y/2 ln(2//3))^ {Vr-d+1-^2 


v^v 


so 


v^{X^X)-^v 
Vr-d+1 \ 


Y r — d 1 + 
which implies 




V21n(2//3) / 


{l-{^,X^X)-^)v< 


o / 21n(2//3) 2 1n(2//3) 

r-d-1 ' r-d-1 


(1 


21n(2//3) s^2 

r—d—1 ' 




> — 


21n(2//3) _ 21n(2/)9) 
r—d—1 r—d—1 


Some arithmetic manipulations show that when < 1 we have that 

r, / 21n(2//3) 2 1n(2//3) 

V r-d-1 r-d-1 




< 


( 1 - 


21n(2//3) s^2 

r—d—1 ' 
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as this is the larger term of the two. □ 

Claim A.2. Fix k and let Xi,... ,Xk be iid samples from Af {0, 1). Then, for any 0 < A < k we have that 
Pi-E* Xf>{Vk + v^)2] < e-^/2 and PrE, Xf < (Vk - VA)^] < . 

Proof. We start with the following calculation. For any X ^ JV {0, 1) and any s < 1/2 it holds that 


E[e®^ 1 = 


dx = 


x^{l-2s) 


y—X\J\ — 2s 
so dy—dx\J\ — 'ls 






Vl-2s Vl-2s 


We now use Markov’s inequality, to deduce that for any A G (0,1/2) 

PrlY^Xf >{Vk + VAf] = = H 


1 


1 - 2A 
< exp 


-\iFk+VAf 


= 1 + 


2A 


1-2A 


= E e 


^-x{Fk+^/Af 


^Xhf,-H'/k+-/Af 




^ = 2 (JfvA) 1 - 2A = we have 


Pr[E Xf > {Vk + vE)^] < exp (^xVklVk + VA) — X{'/k + vE)^^ 

i 

= exp - ivE(v^ + vE)^ = exp(-^) 

A similar calculation shows the lower bound. 

Pr[^ A/ < (v^- VA)2] = Pr[e-^^-E > g-A(v^-vW)"] < ]J E[e-^E]eMv^-VA)2 


1 

1 + 2A 


,\{Vk-FAf = ( 1 _ 


2A 

1 + 2A 




Xk 


- ^ ~ 1 + 2A 


^ = 2(7&S) 1 + 2A = we have 

Pr[E a/ > (-s/fc + vE)^] < exp (^—X'/k{'/k — vE) + X{Vk — vE)^^ 

I 

= exp (^-\'JkA + ivE(v^ - vE)^ = exp(-A 


□ 


Lemma A.3. Fix 5 G (0, e ^). Let X be a matrix sampled from a Wishart distribution Wd{V,m) where 
/m > (^Vd + ^2ln(|)^ . Then, w.p. >1 — 5 we have that for every j = 1,2,... ,d it holds that 


CTj(A) G (Vw± (^A/d+ y 21 n(|)^)Vj(E) 
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Furthermore, we also have that for any 0 < a < m it holds 

II aF - X|| < ||F|| • la - {^/rn - ^Vd + ^2 ln(|)^)^| and 

||(aF)-i -X-i|| < a-l^{V) ■ |a-^ - (^Vd + ^j2\n{j)^)-^\ 

Proof. In order to sample X ^ Wd{V,m) we first sample a matrix Y € which every entry is i.i.d 

normal Gaussian. We then multiply Y by s.t. every row in YV^/'^ is sampled i.i.d from X'{Od, F). We 

then set X = V^I'^Y^YV^I'^. 

Now, we invoke a theorem of Davidson and Szarek [DS01| (Theorem 11.13) that states that for any t > 1 
we have 

Pi’[o’max(F) > ^fm + Vd + t] < e“* and Pr[crniin(F) < y/m — Vd — t] < e“* 


to deduce that w.p. > 1—5 it holds that all of the singular values of Y lie on the interval (^y/rn — (^Vd + 2 ln(|)^ 

Next, we let Uj denote the j-th eigenvector of F, corresponding to the j-th eigenvalue aj{V). Therefore, for 
any j we have 

u/Xu, = iV^/^u,yY^Y{V^/^u,) 

< {Vm+ Y/21n( 1)^)2II= aj{V){y/fn+ + ^2ln(|)^ )2 

u/Xuj > iVm- (^Vd+ ^J2ln{f)^)^V^/'^UJf = aj{V)[^/fri- 2\n{'^^f 

and furthermore, for any subspace S we have that 


max u^Xu < (y^ + (y/d+j2 ln(f)V ( max \\V^/'^Uj\\A 

uG5:||u|| = 1 \ ^ / IhINl / 

min uJXu > {^/m — (Vd + J2 ln(|)^ ( min ||F^/^Mdp^ 
S:|DII = 1 “ ' V ^ J V“ 6 S:||«|| = l" J 


uGS: 11^11 = 


Thus, to complete the first part of the proof, we invoke the Courant-Fischer Min-Max Theorem that 
state that 


,{X) = 


max 


u^Xu = 


mm max 

{SCK^: dim(5)=d—^ + 1} ||it||=l} 


{5CM^: dim(5)=j} {iaGS: ||Ti|| = l} 

Therefore, we can pick S' = span{ui, ... Uj} and S" = span{uj, ..., Ud} to deduce 


vJ Xu 


^{X) > ^min^^ uJXu > {y/m — ^-s/d-I- ^2 ln(|)^ )^aj{V) 
'^Xu < {y/m + ^Vd + \J2 ln(|)^ )^crj (F) 


(Jj{X) < max 


iAeS":||iA|| = l 


As for the second part of the claim, it follows from the fact that aF —X = F^/^ (ad — F^F) F^^^. Now, if we 
denote F = UYU~^ as the SVD decomposition of F, we have al—Y^Y = U {al — E) . Since all the entries 

on the diagonal lie in the range |a — {^/mzL (^Vd -I- ^2 ln(|)^ )^|. As a < m we have that all eigenvalues are 
upper bounded by {m — a) + 2^/rn (^Vd + ^2 ln(|)^ and the claim follows. Similarly, for (aF)“^ — X~^ = 


V (a/ — F^F) F all eigenvalues lie in the range |a ^ — {y/m ± (^Vd + ^2 ln(|)^) ^|, which in this 

case is upper bounded by |a“^ — (v^+ ^A/d -I- ^2 ln(|)^)“^|. We comment that the bounds on ||aF — X|| 
and on ||(aF)“^ — X“^|| require we use both the upper- and lower-bounds on the eigenvalues of F. 


□ 


+ 
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The other two useful tools we use are the formula for rank-1 updates of the determinant and the inverse 
(the Sherman-Morrison lemma). 

Theorem A.4. Let A be a (dxd)-invertible matrix and fix any two d-dimensional vectors u, v s.t. A' 

— 1. Then: 


det(A-I-= det(A)(l-I- ^u) 


{A + uv^) ^ = A ^ 


A A ^ 
1 -I- A~^u 


Proof. Since we have A -|- uv~^ = A{I -\- A~^uv~^), we analyze the spectrum of the matrix / -|- A~^uv~^. 
Clearly, for any a; _L u we have (/ -\-A~^uv~^)x = x-\-0- A~^u = a;, so d — 1 of the eigenvalues of /-L A~^uv^ 
are exactly 1. As for the last one, take a unit length vector jz = and we have {I -L A~^uv^)z = 

1 -I- ||u|| • z~^A~^u = 1 Therefore, det(A-|-Mt;^) = det(A) det(/-l-= det(A)(l 

As for the Sherman-Morrison formula, we can simply check and see that indeed: 


{A-\-uv^){A ^ 


A A ^ 
1 -I- A~^u 


1 -\- uv^ A ^ 
I -\- uv^ A~^ 


uv^ A~^ uv^ A~^uv^ A~^ 

l-\-v^A~^u 

^ 1 A~^u \ 

1 -I- A~^u 1 -I- A~^u J 


= I 


□ 


B Privacy Theorems 

In this section, we provide the formal proofs the our algorithms are differential privacy. We comment that, 
because we hope these algorithms will be implemented, we took the time to analyze the exact constants in 
our proofs rather than settling for 0(-)-notation. In addition to the three algorithms we provide, we give 
another theorem about the privacy of an algorithm that adds Gaussian noise to the inverse of the data, 
which may be of independent interest. 


B.l Privacy Proof for Algorithmic 

Theorem B.l. Fix e > 0 and S G (0, -). Fix B > 0. Fix a positive integer r and let w be such that 


= B^ [1 + 


1-f 


ln(4/5) 


2^2rHl) + 2Hli 


Let A be a {n X d)-matrix with d < r and where each row of A has bounded L 2 -norm of B. Given that 
o’min(A) > w, the algorithm that picks a (r x n)-matrix R whose entries are iid samples from a normal 
distribution JV (0,1) and publishes R ■ A is (e, 6)-differentially private. 

Corollary B.2. assuming e < 1 and 5 < e~^, if it holds that r > 21n(j) then it suffices to have > 


8B^ V’’ W4/<s) results of Theorem 

is publicly known to w, we can set 


B.l 


to hold. Alternatively, given input where its least singular value 


r = 


ew 


8B2 1n(|) 


if indeed r > 2 ln(|) 


and satisfy {e,S)-differential privacy. Therefore, if the rows of A are i.i.d draws from a 0-mean multivariate 


Gaussian with variance S, then we may set r as 


€Crniin(S) 

8B"ln(|) 


= n(n^). 


21 





















Proof. Fix A and A' be two neighboring (n x d) matrices, s.t. A — A' is a rank-1 matrix of the form 

E'^= A —A' = ei{v — v'Y. We thus denote M as the matrix with the f-th row zeroed out, and have = 

AJA — vv^ = A''^A' — v'v''^. Recall that we assume that (Ti„in(^),o’min(^0 ^ s-iid ||£^|| = ||t’ —u'|| <2B. 
We transpose A and R and denote X = A^ and X' = For each column yj of R^ it holds 

that yj^ ~ {On, Inxn), and therefore the j-th column of X is distributed like a random variable from 
N {Or, A^ ■ Furthermore, as the columns of R are independently chosen, so are the columns of X are 
independent of one another. Therefore, for any r vectors a:i, it holds that 

PDFx(a:i, iCr) = ^^ (27r)‘^det(xl'''A)'j eyi];> (^—\xj^{A^A)~^Xj) 

i=i ^ ^ 

PD?x'{xi,...,Xr) = ^ \J{2ttY det(A''''A')'j {—^Xj~'^{A'~'^A')~^Xj) 

j=i V / 


We apply the Matrix Determinant Lemma, and the Sherman-Morrison Lemma, and deduce: 

det(yl'^^) = det(M'rM) (l -F v'^{M^M)-^v) 
detiA'^^A') = detiM'^M) (l -F {M'^M)-^v') 




(A'A)-^ = (M'M)-i - 




1 + {M~^ M)~^v 

fT ( a/tT n/r\-i 


[A'' A')-^ = {M'M)-^ - 


1 -F v'~^{M~^M)~^v' 


Together with the inequality = (1 -F a;)(l — < exp(a: — y^) for any x, y 1 we have 

^ n /ill- - (xrr,-■).,) 

1 

A + v'^{M^M)-^v'\ 2 


n l-FuT(yvy-T^)- 


exp(^-^x/{{A'^A) ^-{A'~^A') ^)xj) 


< 


n exp (v''^{M'^M)-^v' - 


i=i 


/ Xj{M'M) ^v'v''{M'M) ^x 


I + v'^ {M^ M)-^v' 

1 / v^{M^M)~^v Xj~^ {M~^ M)~^vv~^ {M~^ M)~^Xj 

2 V 1 -Ft>T(MTM)-iu 1 -Ft;T(MTM)-iu 

1/ v'^{M^M)-Uj:]^,x,xA{Mm)-W^ 

= exp I - I r • V ^ {M^M) --- —^—- 

^ ' 2 \ ^ ’ I + v'^{M^M)-^v' 


■ exp 


I ... -r / 1. r-r ..T/b r-r Ti *^\ 1 


2 I 1 + uT(^t^)-i^ 


1 -F {M^M)~^v 


( 2 ) 
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Denote 


zi = v'^{M'^M)-^v' - v'^M)-'^ {M~^M)-^v' 


i=i 


Z2 = v'^{M~^M)-^v-v'^{M~^M)-'^ I i^ajja;/ ] {M~^M)-^v 


we have that 


PDFx{xi,...,Xr)\ ^ r /' Zi -Z2 {v'{M~^M) \ 

PDFx'(a;i,...,a ;^)J - 2 \1 + v'{M^M)-^v' ^ l + v(M^M)-^v ^ 1 + v'{M^M)-^v'J 

<L{\z,\ + \z2\ + {v'{M^M)-Wf) 


We now turn to analyze each of the above three terms separately. The easiest to bound are the terms 
v{M^M)~^v and v'M)~^v'. Weyl’s inequality yields that tTinin(Tf^M) > cri„in(^^^) — and so we 

give both terms that bound ^^^2 = • We turn to bounding \zi\, \z 2 \. 

We continue assuming that Xi,... ,Xr were sampled from A^A. If they were sampled from A!^A! then 
the proof is analogous. Denote X as the matrix whose columns are a;i,..., Xr- We have 

Z2 = {{M'^M)-^vY (M^M - (IX^X)) {M^M)-^v 
= {A'^A - vv'^ - ilx'^X)) {M'^M)-^v 

= {{M^M)-^vY{A^A f/'^ i^I - {A^A)-^/^ {A'^A)-^/‘^'j {A^Af/'^{M^M)-^v - {v{M^M)-^vf 


Recall that X is a matrix whose rows are i.i.d samples from the multivariate Gaussian ^(0,^^^). 
Therefore, the rows of the matrix X{AJA)~'^/‘^ are i.i.d samples from Af {0, Idxd)- In other words, the 
distribution of X{AAA)~^^‘^ is the same as a matrix whose entries are i.i.d samples from ^(0,1). We can 
therefore invoke Lemma [A. 1| and have that w.p. > 1 — 5/2. 


|^ 2 | < (-2y^ 2In(4/^) 2 ln(4/^) ^ {A^ Af/"^ {M^ M)-^V ^ + {v{M'^ M)-'^ vf 




<{i- l)^‘ ( 2 ^^ + - 1 )"' (2\/^ + 


21n(4/(5) 


1 
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As the bound on \zi \ is the same as the bound on \z 2 \ we conclude that 


/ PDFjc(a;i, ...,Xr) \ 

VPDFx'(a:i, 


<-{\zi\ + \z2\ + iv'iM^M)-^v'r) 


< 


< 


(2V2rln(4/<5) + 21n(4/(5)' 

+ (b^ - l) (^2V2rln(4/5) + 2 ln(4/5) + 
/ 2^2rln(4/(5) + 21n(4/(5) 


3r 


1 + 




< 


ln(4/5) 

e 


I (2i/2rln(4/(5) + 2ln(4/5))2 16rln(4/5) / 


ln(4/(5) 


1 + 


ln(4/(5) V2 16 


l + 4|l<e 


by plugging in the value of w^. 


□ 


B.2 Privacy Proof for Algorithmic 

Theorem B.3. Fix e G (0,1) and S G (0,1). Fix B > 0. Let Ci and C 2 be such that they satisfy 

2\/C2 ^ e 

Ci(V^-l)2 - ^ 

(E.g., Cl = and C 2 = ^.J Let A be a {nx d)-matrix where each row of A has bounded L 2 -norni of B. Let 
N be a matrix sampled from the d-dimensional Wishart distribution with v-degrees of freedom using the scale 
matrix V (i.e., N ^ Wd{V,v)) for any matrix V with least singular value (Tmin(i^) > Ci (e.g. V = Cildxd) 
and v > [d + 2 C 2 ln(4/i5)J. Then outputting X = A^A N is (e, S)-differentially private. 

We comment that in order to sample such an N, one can sample a matrix N' G of i.i.d normal 

Gaussians, multiple all entries by Bj^ and set N' = N~^N. 

Proof. Fix A and A' that are two neighboring datasets that differ on the f-th row, denoted as in A 
and v'~^ in A!. Let M denote A or A' without the f-th row, i.e. M = A^A — vv~^ = — v'v'~^. 

Therefore, denoting crmin(Af) and fTinin(A) as the least singular value of M and A resp., we have that 
< crAjj(A) < + i?2. Same holds for the least singular value of M and A'. 

Recall that 

PDFw^(y_;,)(fV) oc det(A^) 2 exp (-ltr(R“^A^)) 

We argue that Wishart-matrix additive noise is (e, i5)-differentially private, using the explicit formulation of 
the PDF. For the time being, we ignore the issue of outputting a matrix X s.t. either X — A^A, X — A'^A' 
or Ai — M~^M are non-invertible. (Note, if our input matrix is A, then Pr[X — A^A non invertible] = 
Prjv..^vVd(v,!^)[-^ non invertible] = 0. However, it is not a-priori clear why we should also have Pr[X — 
A'^A' non invertible] = 0 or Pr[X — M~^AI non invertible] = 0.) Later, we justify why such events can be 
ignored. We now bound the appropriate ratios. If we denote the output of the mechanism as a matrix X, 
then we compare 
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u — d—l 


A) ^ / det(X — A) \ 2 _^(^^^(y-i(^x-A'A))-tT:(v-^{x-A'^ a')) 

PD^wavA^ - ^'^^') ~ Uet(^-A'TAOy 


i/—d—l 

f det(X — M~^M — ^ 2 


-d-l 


\det(X — M~^M — v'v'~^) 

1 - v'^iX - M^MAv \ 2 


^-^[tr{V-^{X-A^ A-X+A'^ A'^)) 


1 - V'^{X - M)-^v 

tr{AB)=tT(BA) / I — V^{X — M'^M)~^V 


exp(—itr(y + itr(F 


u—d—l 

T 7l/f^-U, \ 2 


^ exp(— ^v' + \v^V ^v) 


I - v'^{X - M)-^v'^ 

We can now use the inequality = (1 ~ 2:)(1 + < exp(—x + for any x and any y ^ 1 to deduce 

PDF^t^+^(X) 


In 


PDF^,t^,+^(X) 


^ ^ {V~^ -{v-d- 1)(X - M'^MA) V 

- ^ {X - m~^mA - v 

l-v'^{X - M^MAv'^ ’ ) 


1 

- • V 
2 


Note that we either have X — M = X — A^A + vv~^ = N + vv~^ or X — M = N + v'v'^. And so, 
we continue assuming X was sampled using A^A, but the case X was sampled from A!^A is symmetric. 
Further, we only show a bound for the first term of the two above, as the other term will have the same 
upper bound. 


Note that (A - M^MA = A- = A- A'^AA - ^ fiance 


v'^(X - M'^MAv = v'^iX - A'^AAv - 
v'^{X - M^MAv' = v'^{X - A^AAv' - 


{v'^{X — A^A) 

1 + {X - A'^AAv 

A'^iX-A'^AAA 

1 + v~^{X — A^A)~^v 


And so we have: 


v~^{X — A^A) 

1 + v~^{X — A'^A)~^v 


^,T (y-i 1)(X - M~^MA) V 

= v'^ {V-^ - {v-d+ 1)(A - M~^MA) V + - M^XlAv 

< (y-i - {v - d+l){X - A'^AA) V + 2v'^(X - A^ AAv + {v-d+ l)(t’'^(A - A'^AAA 


Now, note that (A — A A) ~ Wd{V, v), and so V ^/^(A — A^ A)V ~ Wdildxd, ^)- This allows us to 
invoke Lemma |A]T] to 


{v-^ - {Ax-M - ^ = {v-^/^vY 


I- 


]/-1/2(A- ATA)1/-i/2 
u — dAl 


-IN 
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and infer that w.p. > 1 — d/2 we have the following bound 

^,T (y-i _ (j, _ ^ _ i)(x - M^M)-^) V 

f 2d+T/in/i///— 21n(4/d) 2 


< 


I (y^_d+l - ^2 ln(4/5)) 


{y/u-d+l — -^2 ln(4/<5))^ (-y/iz-d+l — 




- ^ 21 n( 4 / 5))2 


2y2(j2- d + l)ln(4/d) - 2 ln(4/d) + 2 




^ 2^2(1. - d + 1) ln(4ffl - 2k(4/a) + 6 i,; a 

{\/ v-d+l — Y^21n(4/5))2 

Analogously, w.p. > 1 — d/2 the following bound holds as well: 
V — d — 1 


,./T 




-—(A - - F"M v' 

^v' J 




< v'~^ {{v-d+ 1)(A - M^M)-^ - V-^) v' + 




l-v'^{X - M^M)-^v' 


< v'^ {{v-d+ 1)(A - A^A)-^ - V-^) v' + 


( 2^/2(i 2—d+i) in(4/5) — 2 ln(4/d) 


, {iy-d-l){v'~^{X-A'^A)-^v'y 


1-v'T(X-MTM)-1v' 


< 


I (^i—d+l - y21n(4/5))2 

Combining the two upper bounds we get 


+ 


{y/v-d+l — y21n(4/5))4 j 


In 


PDF^t^+^(A) 

PDF^,t^,+^(A) 


< _ . ( ^-1 _ 


^ — d — 1 


+ 2'"' 


/T 


l-v'^{X- M^M) 
V — d —1 


——{X-M^M)-A V 
J 

-—(A - - y-M v' 

^v' J 


< 


l-v'^{X -M^M)- 
2i/2(j2-d+l)ln(4/d) - 2ln(4/d) + 6 \\V-^/‘^vf + \\V-^/‘^v'\\ 


{^/u-d+l - ^2 ln(4/5))2 


■*<6 2^2(z 2- d + l)ln(4/d) 

0'mi„(C) (v'l^-d+l - v^2ln(4/5))2 

All we now need to do is to plug in the fact that v = [d + C 2 • 2ln(4/d)J > d — 1 + C 2 ■ 21n(4/d), and 
that crmin(l^) > Cl to deduce 


In 


/ PDF^t^+^(A) \ 


2-21n(4/d)- 


^ 2B^^2 ^ 

< -rr < e 


VPDF^,T^,+iv(A)y - Cl (v'C'a • 21n(4/d) - v'21n(4/d))2 “ C'i(VC^- 1) 


2 — 


□ 


B.3 Privacy Proof for Algorithm 


Theorem B.4. Fix e > 0 and d S (0, ^). Fix B > 0. Let A be a {n x d)-matrix and fix an integer v > d. 
Let w he such that 

52 


w^ = 


e(l- 


21n(4/i5) 


(2\/2i/ln(4/d) + 2 ln(4/d) 
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Then, given that crniin(^) > w, the algorithm that samples a matrix from A, v) is (e, 6)-differentially 

private. 


We comment on the similarity between the bounds of Theorem B.l and Theorem |B.4[ This is after 
all quite natural, since the JL-theorem is a way to sample from a Wishart distribution Wd{.^A,r) ( since 
every row in the matrix RA is an i.i.d sample from A/'(0,A^A)). Clearly, one can sample a matrix from 
WdiA^Ajv) and invert it, to get a sample from 'Wf"^{{AAA)~^,r) and vice-versa. Therefore, we get similar 
bounds. The only slight difference lies in the fact that we require in Theorem |B.4| that v > d, s.t. the 
matrix we sample is indeed invertible, whereas we do not require any such lower bound for sampling from 
yVd{A^A,r). 


Proof. As always, we denote A as a neighbor of A that differs just on a single row, which we denote v for 
A and v' for A, and as before, the matrix M is the matrix A with the Tth row all zeroed out. Therefore, 
A^A — vv^ = A~^A — v'v'~^ = M~^M. So, denoting and (Tniin(A) as the least singular value of M 

and A resp., we have that B^. Same holds for the least singular value of 

M and A. 

Recall that 


_ 1/ t^+P+1 

oc det(A^A )2 det(A) 2 exp (-itr((A^A)X“^)) 

We invoke the determinant update lemma, the Sherman Morisson lemma and the inequality < exp(a: — j^) 
yet again to deduce: 


PDF 


w 


^ ^(ArA,u) _ det(ATA)'^/2exp(-itr((ATA)A-i)) 


PDF^- 1 (^,t^,_^)(X) det(A'TA')'"/2 exp {-^ti{{A~^A)X-^)) 

y/2 


1 -F v' [M' M) 


1 


exp ( -^ii{{A' A - A^A)X-^ 






< exp ( - \ v' {M' - 

= exp ( - M)~^v — v~^X~^v) — 


^ --«VT)A 1))^ 


1 / z/• 




2 \1 + v''^{M'^M)-^v' 

< exp (ip- (p(M-M)- - .Y-') p - ip- - -V‘) p') 


We continue assuming X ~ ^{A^A,i 2 ) (the case X ~ '^{A^A,v) is symmetric). By definition, we 

have that X~^ ^ Wdii^A)~^,v). Hence Wdildxdjt^), which implies that the 

distribution of (A'^Af/^X-^A'^A)-^/^ is the same as generating a, {ix x d)-matrix of i.i.d A {0, 1) samples 
and take its cross-product with itself. 

We continue using the Sherman-Morrison formula, and derive the bound 


- X-^) v = v^ {v{A'^A)-^ - X-^) v - 


V ■ [v^{AA) 


1 — {A^ A)~^v 

< {{A^A)-^/^vf (vidxd - {A^Af/^X-\A^A)^/^) {{A^A)-^l^v) 

< ||(ATA)-i/2^,||2 {2^j2v\a{A/5) -F 21n( 
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which holds w.p. > 1 — 5/2 due to Lemma A.l Similarly, we have 


— V 


/T 


- (.(M-M)- - X-') 

^ \ — v'^ {M^ M)~^v' 


= -v''^ {v{A^A)-^ - X-^) v' + 

< -v'^ {v{A^A)-^ - X-^) v' + 

< 


, v ■ {y'^{M^M) ^v'Y V ■ {v'^{AAA) 


1 — v'^ {M^ M)~'^v' 1 — v'^ {A^ A)~^v' 

, V ■ (v'~^{A^A)~^v)^ V ■ {v'~^{A^A)~^vy' 


1 — v'~^ {M~^ M)~^v' 1 — v'~^ {AA A)~^v' 

II {A^A)-^/^v'f (2^2v\n{A/5) + 21n(4/(5)) 

+ 1/ • \\{A^A)-^/^v'f\\{A^A)-^/^v\\^ 


1 — v'~^ {M~^ A'I)~^v' 1 — v'~^ {A^ A)~^v' 

Denoting the least singular value of (AAA) as and using the fact that ||i;||, ||i)'|| < B and crudely upper 
bounding v'~^{M~^M)~^v' and v'~^{AAA)~^v' by ^ we get 


In 


As we have uA = —- — 


PDFw-bATA,.)W \ < 1 2 . A (2v/2^1n(4/<5) + 21n(4/5)) + 1 • :^(4^ + Au) 
(^^2v\n{A/5) + 21n(4/5)^ we get that 


2ln{i/S)> 


PP’Pyvyb-^'TA',!/) (^) ^ 


)) + 


Av 


2\n(i/S))^ Sv\Yi{A/5) 


< e 


□ 


B.4 An Additional Privacy Theorem — Gaussian Noise for the Inverse 

Theorem B.5. Fix e S (0,1) and 6 G (0, e“^). Let A be a (n x d)-matrix where the l 2 -norm of each row 
is bounded by B, where it is publiely known that > 1 + p with p > 0. Then the algorithm that 

outputs {A'^A)~^ + N where N is a symmetric matrix with each entry on or above the main diagonal of N 
is sampled i.i.d from Af ^0, ^ ^ is (e, S)-differentially private. 

Proof. The proof of the theorem just bounds the Z 2 -global sensitivity of the inverse, using the Sherman 
Morrison formula. We then use the fact that by independently adding noise to each entry in {A^A)jl for 

j < k where the noise is sampled i.i.d from Af ^0, GS'| • ^ ^°siVS) ^ 5)-differentially private. 

Denote A and A', two matrices that differ on a single row, which is denoted d in A and v' in A. Therefore, 
A'~^A' = AAA + v'v'~^ — vv^, so Weyl’s inequality gives that |crniin(A^A) — crniin(A'^A')| < B"^. Denoting 
M as the matrix we get by zeroing out the *-th row on A or A', we have 


Hence, 


(A'^A')-^ 


{A'~^A')-^ = {M^M)-^ 


{M^ M)-^vv^ {M^ M)-^ 
1 + {XI~^ M)~^v 

1 + v'~^ {M~^ M)~^v' 


(A'A)-^ = {M'M) 


-1 


1 + v~^ {M~^ M)~^v 


1 + v'~^ (M~^ M)~^v' 


{M^M) 


-1 
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Let Xi,X 2 ,... ,Xd be the eigenvectors of M, corresponding to the eigenvalues fj-i,..., fid- Then, for any j, k 
we have 


a./ {{A'^AT^ 


(A'^A) 1) Xk = fij Vfc ^ 


(r; ■ Xj){v ■ Xk) 

1 + {M~^ M)~^v 


jv' ■ Xj){v' ■ Xk) \ 
1 + ) 


Due to Weyl’s inequality, amin{M~^M) > aram{^A) — ||u|p > p ■ . And so, together with the inequality 

(a — < 2a? + 26^ we get 


{A'^A')-^ - {A^A)-% = ^ (x/ ((A'Ta')-i - (A^A)-!) xk)^ 

i,fe=i 

_ _ 2 _ {v ■ Xj)'^{v ■ Xk)^ 

1 + ^ p]pl 

_2_ (v' ■ Xj)‘^(v' ■ Xk)^ 

- (om2 • ^k)^ + 

' 3,k 

' 5 ((?‘” ■ (?'” ^^ ) 

2||ur + 2||u'r 4 

(pB2)2 - (pB2)2 p2 


C utility Theorems 

In this section we provide the utility statement for the Analyze Gauss algorithm and the additive Wishart 
noise algorithm. Throughout this section we assume our database D e is in fact composed of D = [X; y] 
where X € and y G M" (so we denote p = d— 1). Clearly, to assume y is the last column of D simplifies 
the notation, but y can be any single column of D and X can be any subset of the other columns of D. 

In this section we will repeatedly use the Woodbury formula, which states that for any invertible A € 
and U € and V € of corresponding dimension we have 

(A + UV)-^ = A-i - A-^U {Ikxk - VA-^U)~^ VA-^ 

which implies that for any B € Rp^p we have the binomial inverse formula: 

(A + B)-^ = A-i - A-i(/pxp - BA-i)-iBA-i (3) 


Our goal is to compare the distance between our predictor to the predictor one gets without noise, i.e. 
to f3 = (X~’^X)~^X^y. Since we release a matrix D^D that approximates D, we can decompose it into 

the p X p matrix X^X and the p-dimensional vector X^y and compute /3 = {X~^X)~^X'^y. We thus give 
bounds on 


/3-3 


{^)-^X^y-{X^X)-^X^y 


Our analysis presents utility analysis that depends on the input parameters. This is in contrast to 
previous works on DP ERM that give a uniform bound and obtain it via regularization of the problem. 
(This is natural, as for X = 0„xp clearly (3 is ill-defined unless we regularize the problem.) 
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Theorem C.l. Fix X G and y € K" s.t. X^X is invertible. Fix rj G (0,1) and v G (0,1/e). Denote 

X^X = X~^X + N and X^y = X^y + n where each entry of N and n is sampled i.i.d from J\f (0,(T^). Then, 
there exists some constant C > 1 s.t. if we have that ~ /v), then w.p. > 1 — v we 

have 


f3-(3 


{X^X)-^X^y-{X'X)-^X'y 


<2r;/3+^ 


We comment that this is not precisely the same as the behavior of the “Analyze Gauss” algorithm. The 
difference lies in the fact that Analyze Gauss outputs X^X + M where M is a symmetric matrix whose 
entries along and above the main diagonal are sampled i.i.d from a suitable Af (0,a'^). However, one can 
denote M = -^{X + N~^) for a matrix N whose entries are i.i.d samples from A/'(0 ,(t^), and so the same 
result, up to a factor of y/2, holds for Analyze Gauss. 

Proof. Plugging in ([^ we get 

(ATA)-iFy = (Jpxp - {X^X)-\lp,,p - N{X^X)-Y^N) {X^Xy^X^y 


+ (Ip^p - (aTa)-1(4xp - N{X^X)-Y'X) (X^X)-^ 


n 


Denoting Z = (A^A) ^{Ipxp — N{X~^X) ^A, we derive a bound on (A^A) ^A^y — (A^A) ^A^y 

using bounds on \\Z\\, |j/ —Zjl and ||n||. 

Standard bounds on a symmetric ensemble of Gaussians |Taol2j give that || A|| < C ■ (7yp\og{l/u) w.p. 
> 1 — I for some suitable constant C > 0. Hence we have that ||A|j • |j(A^A)“^|| < rj. Hence, all singular 
values of A(A^A)“^ are upper bounded in absolute value by y, and so all singular values of / —A(A^A)“^ 
lie in the range [1 — y, 1 + y]. This implies that \\Z\\ < and ||/ — Z|j < 1 + Next we note 

that ||n|p ^ cr^ • Xp, and so, w.p. > 1 — | it holds that |jn|| < cr(Y^ + y/2 ln(2/zz)). 


Thus, we get 


(3-f3 


< 


1 - y 


11^1! 


1 V2cr2 1n(2/i/) ^ y 


1-y 


,(ATA) 


1-y 


1^11 + 7^ 


Corollary C.2. Denote p = Jot the same constant C in Theorem 

we have 

2C..~. 1 


C.l 


□ 


if p > 2C then 


(3-13 


<-11^11 + - 
P P 


Proof. The proof follows from Theorem G.l and the observation that we can flip the role of A^A and A^A 


because the Gaussian distribution is symmetric. And so, we just use the notation p = C. 
Theorem C.3. Let W ~ yVp+i(tT^/, A:), and denote N G and n G W s.t. W = 


N 

n7 


n 

* 


□ 


Let 


X G be a matrix s.t. A^A is invertible and let y G M", and such that there exists a C > 2 s.t. 

o’min(A^A) = C ■ <7^{\/k + yp + ^2 \{\{A/v)Y . Denote A^A = A^A + N and X^y = X^y + n. Then 

1 a^{C-2) 


(3-f3 


< 


C-1 


II/3II + 


(C-l)a„,i„(ATA) 


lin ^2y/2kp ■ \n{3p/v), {Vk + + a/ 2ln(4/z/))^| 


Proof. Because is a diagonal matrix, standard results on the Wishart distribution give that N ^ 
Wp(cr^Jpxp, k). We therefore denote i? as a (fc x y)-matrix of i.i.d samples from a normal Gaussian ^(0,1), 
and have N = R. The Woodbury formula gives that 


rT 


(A'A +A)-^ = (A'A)-^ - cr/A'A)-^A' (/- cr^A(A' A)-^i?' )-^A(A' A) 
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Denoting Q = aR{X~^X) we get 


= (X^X)-^ - (X^X)-^/^ [gT(/ _ QQT)-igj (xTx)-i/2 


Now, if we denote Q = UAV~^ where Q’s singular values are Ai,..., Xd, we get Q^{I — QQ^) = V ■ 
diag (t^) ■V~^ = V- diag - 1^ • Note that Q^g = 


due to Lemma 


A.3 


we have Xf = crinax(g^g) < 


cr^(Vk+^+^/2h^(4/^f 
a„i„(XTX) 


< C ^ w.p. > 1 — Z//2. Which 


means that w.p. > 1 — vj^ we have (Tmax(Q^(d — QQ^) ^Q) < c^- And so we have that both (i) 
{X^X)-^ - {X^X + N)-^ ^ -^{X^X)-^ and (ii) {X^X + N)-^ ^ §E^(X~^X)-A 

Next we turn to bound ||n||. One easy bound, given Lemma A.3[ is to show that w.p. > 1 — 12/2 it holds 


that 




< liW^edll < IIW|| • 1 < a^iVk + ^+ V21n(4/j.))2 


Alternatively we can derive the following bound. Each coordinate in n is the result of the dot-product 
between the j-th column of R, denoted rj with the d-th column of R, denoted r^. Each coordinate in R 
is sampled i.i.d from A/'(0 ,(t^). Next, we use the fact that for two independent Gaussians with the same 

variance X,Y ^ A/" (O, cr^) it holds that XY = -with X(^X + E) and ^(X — Y) are two 

independent^ Gaussians A/’^O, And so Vj ■ rd = Zj^ — Zj^ where Zj^,Zj^ ~ ^ ' Afc- Tail bounds 
for the x^-distribution (see Glaim B give that w.p. > 1 — vj^ it holds that each coordinate of n is 
bounded in absolute value by + ^2ln(4p/:/))^ — ^{y/k — x/2Tn(4p7^)^ = 4-^2/;ln(4p/j/), which 


means ||n|| < 2a‘^yj2k ■ ln(4p/i/)|^ 


Gombining both bounds, we have that w.p. >1 — v \% holds that 

3 - /3 = {(X^X)-^ - {X^X + N)-^) X^y - {X^X + N)-^n 

. l 0.^-2 


or: 


^ll/3-/3||< 


11/3-/3|| < 


C- 1 


-\\{X^X)-^X^y\\ + ^^^^_ ^ \ x^X)-^^2kp-\n{ipM 

\j2kp ■ \n{‘ip/v) 

II/3II + — “vT v\ + v^+ v^2 \n{A/v)f 


C- 
^11/311 
1 


2a‘^{C-2) 
(C-l)u„,in(XTX) 
a'^iC-2) 
(G-l)a„in(XTX)' 


□ 


^®This is where we need to use the fact that X and Y have the same variance. We have 


X + Y 
X -Y 


1 1 
1 -1 


Y 


and so the variance of 


X Y \ 

y j is diagonal iff X and Y have the same variance. 

'^^We conjecture that the true bound in log(p)-factor smaller, i.e. 0{a^ ^2kp ■ \n{A/u)). 
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